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ABSTRACT 

In  this  paper  we  establish  strong  consistency  and  asymptotic  normality  of 
the  Robbins-Monro  (RM)  recursive  m-estimator  under  conditions  allowing  for 
weakly  dependent  data,  based  on  the  general  results  of  Kushner  and  Clark 
(1978)  and  Kushner  and  Hwang  (1979).  We  specialize  further  in  order  to 
establish  the  asymptotic  properties  of  three  implementations  of  the  RM 
procedures  for  the  nonlinear  regression  model.  The  nonlinear  regression 
results  are  also  applied  to  the  estimation  of  a  feedforward  neural  network 
model.  Our  results  provide  readily  verifiable  conditions  and  generalize  many 
previous  results  in  nonlinear  regression  and  neural  network  learning. 
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1.     INTRODUCTION 

To  locate  the  zero  0*  of  an  unknown  function  4*(0)  Robbins  and  Monro  (1951)  introduced 
the  stochastic  approximation  (SA)  method.  The  Robbins-Monro  (RM)  algorithm  recursively 
approximates  0*  by 

9t+l=0t+aty/(Zt,et)       r=  1,2,...,  (1.1) 

where  at  is  a  "learning  rate"  tending  to  zero,  and  y/(Zt,d)  is  a  measurement  of  *F(0)  at  time  r, 
influenced  by  random  variables  Zt.  When  ¥(0)  =  E(y/(ZtJ  0))  this  method  yields  a  recursive 
implementation  of  the  method  of  m-estimation  of  Huber  (1964).  In  particular,  the  method  can  be 
used  to  estimate  recursively  the  parameters  of  nonlinear  regression  models,  such  as  those  arising 
in  certain  neural  network  applications. 

The  RM  algorithm  has  two  significant  advantages:  (1)  its  recursive  nature  places  few 
demands  on  computer  resources;  and  (2)  in  theory,  just  one  pass  through  a  sufficiently  large  data 
set  can  yield  a  consistent  estimate.  The  RM  algorithm  is  therefore  particularly  appealing  for 
estimating  parameters  of  nonlinear  models  in  large  data  sets. 

Very  general  results  relevant  to  the  convergence  properties  of  the  RM  algorithm  have  been 
given  by  Kushner  and  Clark  (1978)  (KC)  and  Kushner  and  Hwang  (1979)  (KH).  However,  the 
conditions  of  KC/KH  are  not  primitive  and  require  some  effort  to  apply.  The  purpose  of  this 
paper  is  to  bridge  an  existing  gap  between  the  results  of  KC/KH  and  some  interesting  and  fairly 
broad  application  areas.  Our  results  provide  conditions  simpler  and  easier  to  verify  than  those  of 
KC/KH;  for  our  applications,  we  obtain  useful  generalizations  of  results  previously  available. 

Specifically,  we  specialize  the  results  of  KC/KH  to  establish  the  consistency  and 
asymptotic  normality  of  the  Robbins-Monro  (RM)  recursive  m-estimator  under  conditions 
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allowing  for  moderate  dependence  in  the  underlying  stochastic  process  {Z,},  relying  on 
mixingale  convergence  results  of  McLeish  (1975).  These  results  extend  recent  results  of 
Englund,  Hoist  and  Ruppert  (1988).  We  specialize  further  in  order  to  establish  the  properties  of 
three  implementations  of  the  RM  procedure  applicable  to  the  nonlinear  regression  model  ~  the 
"simple,"  "quick"  and  "modified"  RM  procedures.  This  permits  us  to  generalize  certain  results  of 
Albert  and  Gardner  (1967),  Ljung  (1977),  Ruppert  (1983),  Ljung  and  Soderstrom  (1983), 
Metivier  and  Priouret  (1984)  and  White  (1989).  Because  the  (extended)  Kalman  filter  for  a 
particular  system  coincides  with  the  modified  RM  procedure,  our  results  rigorously  establish  the 
consistency  and  asymptotic  normality  of  this  extended  Kalman  filter  in  a  setting  somewhat  more 
general  than  previously  available.  Finally,  the  nonlinear  regression  results  are  applied  to  the 
estimation  of  parameters  in  a  leading  neural  network  model,  considerably  generalizing 
previously  available  results  for  network  learning  (e.g.,  White,  1989). 

The  paper  is  organized  as  follows.  In  Section  2  we  provide  conditions  ensuring  the  strong 
consistency  and  asymptotic  normality  of  the  general  RM  m-estimation  algorithm.  In  Section  3 
we  introduce  implementations  of  the  RM  algorithm  suitable  for  use  in  the  nonlinear  regression 
problem  and  provide  conditions  establishing  the  consistency  and  asymptotic  normality  of  these 
methods.  Section  4  contains  the  neural  network  application,  and  Section  5  contains  a  summary 
and  a  discussion  of  directions  for  further  research.  A  mathematical  appendix  contains  the  proofs 
of  all  results. 

2.     ASYMPTOTIC  PROPERTIES  OF  THE  RM  M-ESTIMATOR 

The  ordinary  differential  equation  (ODE)  method  for  establishing  consistency  of  recursive 
estimators  introduced  by  Ljung  (1977)  and  followed  here  makes  use  of  certain  interpolated 
processes.  Given  a  sequence  {at  e  IR+),  let  z,  =£'ln  tfi,T0  =  0.   Define  the  piecewise  linear 


interpolation  of  {0,}  with  interpolation  intervals  {at}  as 

e°(r)  =  (rt+1-r)dt/at  +  (r-rt)Ot+l/at    tg   [t„t,+1)  (2.1) 

and  the  piecewise  constant  interpolation 

Per)-**.    T  6   [T„T,  +  1).  (2.2) 

We  also  make  use  of  the  leftward  shifts  of  d°{  • ), 

0'(t)=0°(t+t,),    t>0.  (2.3) 

Note  that  (2.1)  defines  a  continuous  process  on  [0,°°),  while  (2.2)  is  a  process  on  [0,°°)  right 
continuous  with  left  limits.  {#'(•)}  is  a  sequence  of  continuous  processes.  Strong  consistency  is 
obtained  via  the  ODE  method  by  showing  that  {#'(•)}  has  a  convergent  subsequence  satisfying 
an  ordinary  differential  equation,  0  =  *F(0). 

Our  consistency  result  follows  as  a  consequence  of  Theorem  2.4.2  of  KC.  We  make  use  of 
the  following  conditions. 

ASSUMPTION  A.1:  (ft,  F,P)  is  a  complete  probability  space  on  which  is  defined  the 
sequence  of  /F-measurable  functions  {Z, :  ft-»  Rs,  t  =  1,2, ...},  s  e  N=  {1,2, ...}. 

ASSUMPTION  A.2: 

(a)  y :  Rs  x  Rk  ->  #?*  is  measurable-  fi*+*/  #\  where   ffi*  is  the  Borel  field  over 
Rk,ke  IN. 

(b)  There  exists  a  compact  set  0  c  Rk  such  that: 

(i)     there  exist  functions  b:®->  R  +  ,hx  :  Rs  ->  R  +  ,h2:  Rs  -*  #?  +  ,  where  b  is 
continuous  on  0,  and  /i  i  and  /z2  are  measurable-  #*  such  that  for  each  (z,  0)  in  Rs  x® 

I  y(z,0)  I  <6(0)A1(z)  +  A2(z);and 
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(ii)     there  exist  functions  p\  :  R+  -»  R+  and  /i3  :  ZR*  ->  R+  such  that  pi(u)  -» 0  as 
u  ->  0,  /i3  is  measurable-  £?*,  and  for  each  (z,  0t ,  Q2)  in  ^Jx0x0 

I  yr(z,0i)-V(*.02)  I  £pi(  I  *i  -02  I  )M»). 
where  I  •  I  denotes  the  Euclidean  norm. 

ASSUMPTION  A.3:  E  y(Z„0)  <  ~  for  each  0  in  0,  and  there  exists  a  function  ¥ :  0  ->  Rk 
continuous  on  0  such  that  for  each  6  in  0  *F(0)  =  lim,  _>„,  E  y/(Zt,  0). 

ASSUMPTION  A.4:  {at }  is  a  sequence  of  positive  real  numbers  such  that  at  -» 0  as  t  -» °o  and 
£"=0a*  ->  °°  as  n  -*  °°- 

ASSUMPTION  A.5: 

(a)  For  each  0  in  0,  £*    at[y(Zt,d)  -  Ey/(Zt,6)]  converges  <z..s.  -/>;  and 

(b)  For  y  =  1,2,3,   there    exist   bounded    non- stochastic    sequences    {t]j(}    such    that 

E,n=o  flr[A;(Zr)  -77yJ  converges  a.j.-P. 

Assumption  A.  1  introduces  the  data  generating  process,  and  Assumption  A.2  imposes  some 
suitable  and  relatively  mild  restrictions  on  the  growth  and  smoothness  properties  of  the 
measurement  function  yr.  Assumption  A.3  is  a  mild  asymptotic  mean  stationarity  requirement. 
In  Assumption  A.4,  the  condition  at  ->  0  ensures  that  the  effect  of  error  adjustment  eventually 

vanishes;  the  condition  £n    at  -» «>  allows  the  adjustment  to  continue  for  an  arbitrarily  long 
time,  so  that  the  eventual  convergence  of  (1.1)  is  always  plausible. 

Assumption  A.5  imposes  mild  convergence  conditions  on  the  processes  depending  on  Z,. 
Below  we  consider  more  primitive  mixingale  conditions  that  ensure  the  validity  of  this 
assumption. 
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Let  n :  Rk  ->  8  be  a  measurable  projection  function  (for  0  s  0,  n(9)  =  0).  We  then  have 
that  for  all  RM  estimates  0,,;r(0/)e  0.  In  what  follows,  9t  will  also  denote  the  projected 
process  for  the  sake  of  notational  convenience,  and  the  interpolated  processes  are  understood  to 

A.  y% 

be  those  of  the  projected  process.  We  write  0 ,  -» 0*  as  t  -»  °o  if  infe €  ©•  I  9,-9  I  -»  0  as  t  -»  «». 
We  have 

THEOREM  2.1:  Suppose  that  Assumptions  A.1-A.5  hold,  and  let  [9 ,}  be  given  by  (1.1)  with 
0  o  chosen  arbitrarily.  Then 

(a)  There  exists  a  /'-null  set  Qq  such  that  for  cd4C2q,  (0'(-)}  is  bounded  and 
equicontinuous  on  bounded  intervals,  and  {#'(•)}  has  a  convergent  subsequence  whose  limit  0(-) 
satisfies  the  ODE  9  =n[H?(d)]. 

Let  0*  be  the  set  of  locally  asymptotically  stable  (in  the  sense  of  Liapunov)  equilibria  in  ©  for 
this  ODE  with  domain  of  attraction  d(S*)  c  Rk. 

(b)  If  0  c  d(&*),  then  9 ,  -» 0*  as  t  ->  ~  with  probability  one  (w.p.  1). 

(c)  If  0  is  not  contained  in  d(S*)  but  for  co  4Qq,  9t  enters  a  compact  subset  of  <i(0*) 
infinitely  often,  then  9 1  -*  0*  as  r  -» ~  w.p.  1. 

(d)  Given  the  conditions  in  (c),  if  0*  contains  only  finitely  many  points  ,  then 
0,  -><9*  e  0*  asr->«>  xv.p.l.         U 

Theorem  2.1(a)  indicates  that  the  path  of  the  RM  estimates  behaves  like  a  solution 
trajectory  of  a  corresponding  ODE  asymptotically.  We  note  that  a  zero  point  of  ¥(9)  need  not  be 
asymptotically  stable.  A  sufficient  condition  to  ensure  asymptotic  stability  of  an  equilibrium  9* 
is  that  all  the  eigenvalues  of  the  matrix  V9  ^(9*)  have  negative  real  parts  (e.g.,  Sydsater,  1981,  p. 
362).   If  9t  belongs  to  the  domain  of  attraction  of  9*  in  0*  infinitely  often,  we  can  extract  a 
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convergent  subsequence  of  #'(•)•  Denote  the  limit  by  0(-)-  Clearly,  0(0)  also  belongs  to  the 
same  domain  of  attraction,  and  by  asymptotic  stability  of  0\  6  (t)  ->  0*  as  r  -»  °°.  Otherwise,  the 
path  of  0,  may  follow  a  solution  trajectory  that  is  either  divergent  or  cycling.  Theorem  2.1(b) 
and  2.1(c)  are  thus  analogous  to  the  results  of  Ljung  (1977,  Theorem  1 ).  Theorem  2.1(d)  further 
indicates  that  cycling  between  two  asymptotically  stable  equilibria  is  not  possible. 

This  result  generalizes  classical  results  (e.g.,  Blum,  1954)  in  several  respects.  First,  Z,  is 
not  required  to  enter  the  function  y/  additively.  Second,  the  learning  rate  at  is  not  required  to  be 
square  summable.  Most  importandy,  general  behavior  for  Z,  is  allowed,  provided  that 
Assumption  A.5  holds.  As  examples,  KC  consider  martingale  difference  sequences  and  moving 
average  processes. 

A   general    class   of  stochastic   processes    satisfying   the   convergence    conditions    of 
Assumption  A.5  is  the  class  of  mixingales  (McLeish,  1975).   Let  ||-||p  denote  the  Lp-norm, 
\\X  ||p  ■ (E  I  X  I  p)llp.  When  \\X  \\p  <  «  we  write  X  e  Lp(P).  If  X  is  a  matrix  or  vector,  X  e  Lp(P) 
whenever  each  element  of  X  belongs  to  LP(P).   In  this  case  ||  - 1| ^  is  as  just  defined,  with  I  •  I 
denoting  the  spectral  norm  induced  by  the  Euclidean  norm.  We  use  the  following  definition. 

DEFINITION  2.2:  Let  {X,}  be  a  sequence  of  random  variables  belonging  to  L2(P)  and  let  { IF'} 
be  a  filtration  of  F.  The  sequence  [Xt,  IF1}  is  a  mixingale  process  if  for  sequences  of 
nonnegative  real  constants  [ct }  and  {£m}  where  Cm-»0  as  m-»°°,  we  have 
||£(X,  I  IF'^y^c^m  and  \\Xt-E(Xt\F'+m)\\2  <ct  fm+1.  {Xt}  is  a  mixingale  of  size  -a  if 
C,m  -  0(mx)  for  some  X  <  -a.  (We  drop  explicit  reference  to  the  filtration  when  there  is  no  risk  of 
confusion.)         □ 

When  C,m  satisfies  this  last  condition,  we  also  say  that  £m  is  of  size  -a.  Our  definition  of  size  is 
convenient,  but  also  stronger  than  that  considered  by  McLeish  (1975).    As  special  cases, 
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mixingale  processes  include  independent  sequences,  martingale  difference  sequences,  <f>-,  p-  and 
a-mixing  processes,  finite  and  certain  infinite  order  moving  average  processes,  and  sequences  of 
near  epoch  dependent  functions  of  infinite  histories  of  mixing  processes  (discussed  further  in  the 
next  section).  Mixingales  thus  constitute  a  rather  broad  class  of  dependent  heterogeneous 
processes. 

In  our  applications,  we  always  assume  that  the  relevant  random  variables  are  measurable 
-F',  so  that  the  second  mixingale  condition  holds  automatically.  This  avoids  anticipativity  of 
the  RM  algorithm. 

The  following  conditions  permit  application  of  McLeish's  mixingale  convergence  theorem 
(McLeish,  1975,  Corollary  1.8)  to  verify  the  conditions  of  Assumption  A.5. 

ASSUMPTION  A.4':    {at}  is  a  sequence  of  real  positive  integers  such  that  £~    a}  <°°  and 

Sn  ,  &t  -» °°  as  n  ->  °o. 

ASSUMPTION  A.5': 

(a)  For  each  d  in  0,  sup,  || y/(Zt,  0)  1 2  <  Ae  <  «>  and  {y/"(Z„  d)  -  E\j/(Zt, d),  IF1}  is  a  mixingale 
of  size  -1/2,  where  F'  =  a{Z\ , ...,  Zt); 

(b)  For  ;  =  1,2,3,  sup,  ||/iy(Zr)||2  ^  A  <  oo  and  [hj(Zt)  - Ehj(Zt),  IF1}  is  a  mixingale  of  size 
-1/2. 

Assumption  A.4'  implies  Assumption  A.4.  Note  also  that  sup,  \\y/(Zt,0)\\2  ^A0  <°°  is 
implied  by  Assumptions  A.5'(b)  and  A.2(b.i),  and  that  we  may  take  r\jt  =  Eh}(Zt).  We  have  the 
following  result. 

COROLLARY  2.3:  Given  Assumptions  A.  1 -A.  3,  A.4' and  A.5  Met  {0,}  be  given  by  (1.1)  with 


0O  chosen  arbitrarily.  Then  the  conclusions  of  Theorem  2.1  hold.        □ 

This  provides  general  and  fairly  primitive  conditions  ensuring  the  convergence  of  6t.  Only 
Assumption  A.5'  is  a  reasonable  candidate  for  further  specialization  to  achieve  additional 
simplicity.  This  is  most  conveniently  done  by  placing  conditions  on  h\,  h2,  h3  and  {Zt}  sufficient 
to  ensure  that  the  mixingale  property  is  valid.  We  give  examples  of  this  in  the  next  section. 

The  present  result  gives  a  very  considerable  generalization  of  a  convergence  result  of 
White  (1989,  Proposition  3.1).  There  Zt  is  taken  to  be  an  ii.d.  uniformly  bounded  sequence. 
Corollary  2.3  also  generalizes  results  of  Englund,  Hoist  and  Ruppert  (1988),  who  assume  that 
{Z, }  is  a  stationary  mixing  process  and  that  y  is  a  bounded  function. 

Asymptotic  normality  follows  as  a  consequence  of  Theorem  2  of  KH.  As  KH  show,  the 
fastest  rate  of  convergence  obtains  with  at  =  (f  + 1)"1;  we  adopt  this  rate  for  the  rest  of  this 
section. 

For  given  6*  e  lRk  we  write 
Ut=(t+lf  (0,-6T). 
Straightforward  manipulations  allow  us  to  write 

Ut+l  =  [/*  +  (f  +  I)"1  Ht]  U,  +  (*  +  I)"*  rf,  (2.4) 


where 


Hi  =  v0  y;  +  [«* +2)  /  (r  +  i)),/i  -l]  v9  ¥;  +  /A  /  2  +  oca  +  lr1)  /* 

+  ((r  +  2)  /  (r  +  l)),/l  I1  [Ve  y(Zlt  0*  +  s(et  -$*))  -  V0  v^n  ds,  (2.5) 


and 


rf-((*+2)/(r+ 1))>;, 


with  \f/*  =yr(Zt,d*),Vg  y*  a  Vey(Z„G*).  The  piecewise  constant  interpolation  of  U,  on  [0,°°) 
with  interpolation  intervals  {at}  is  defined  as 

U°(T)  =  Ut,     T  <E    [Tr,T,+1), 

and  the  leftward  shifts  are  defined  as 

i/'(T)2i/0(T,+T),     T>0. 

The  asymptotic  distribution  of  0t  is  found  by  showing  that  U'(-)  converges  to  the  solution  of  a 
stochastic  differential  equation  (SDE)  and  characterizing  the  weak  limit  of  Ul(  • ). 

We  adopt  the  following  conditions: 

ASSUMPTION  B.l:  Assumption  A.l  holds  and  {Zr,  t  =  0,  ±1,  ±2, ...}  is  a  stationary  sequence 
on  (ft,  IF,P). 

ASSUMPTION  B.2: 

(a)  Assumption  A.2(a)  holds;  and 

(b)  For  each  z  e  Rs,  y(z,  • )  is  continuously  differentiable  such  that  there  exist  functions 
p2  :  IR+  ->  R+  and  h4  :  /R*  -»  tf?+  such  that  P2O)  ->  0  as  k  -» 0,  A4  is  measurable-  S',  and  for 
sometf0  interior  to  0  and  each  (z,0)  in  Rs  x  0°,  0°  an  open  neighborhood  in  0  of  9°, 

I  V9y(z,d)-V9Y(z,0°)\<p2(\e-do  l)A4(z). 

ASSUMPTION  B.3:  There  exists  0*  e  int  0  such  that  d*  =  6°  in  Assumption  B.2,  Eyf*  =  0, 
V,*  e  L6(P),  Vg  y/*  e  L2(P),  and  the  eigenvalues  of  H  =  H*+  Ikl  2  (with  //*  =£(V0  y*))  have 
negative  real  parts. 
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ASSUMPTION  B.5: 

(a)  Let  F°  =cr(Ztt  t  <  0)  and  suppose 

(i)  Er=o^<00»^all£^?  '  r°)h 

(ii)Zr=o^/,<00'^ssuP^II(£(^>'^  '    F°)-°j\\2>aj=E(y/;yf;'+jy, 

(b)  For  some  r\  4  e  tf? ,  X"=0  (f  +  1  )_1  [A 4 (Zr)  -  77  4  ]  converges  a.s.-P;  and 

(c)  E;=0  ('  +  !)_1  tv*  V?  ~Hm]  and  JJ^  (*  +  l)"1  [  I  ^e  V*  I  -hm]  converge  oj.-P,  where 

The  stationarity  imposed  in  Assumption  B.l  is  extremely  convenient;  without  this,  the 
analysis  becomes  exceedingly  complicated.  Assumption  B.2(b)  imposes  a  Lipschitz  condition 
on  V0  y/  analogous  to  that  of  A.2(b.ii)  for  y/.  Assumption  B.3  imposes  additional  moment 
conditions  and  identifies  0*  as  a  candidate  asymptotically  stable  equilibrium.  As  we  take 
at-{t+  l)~l,  there  is  no  analog  to  Assumption  A.4  or  A.4\  Finally,  Assumption  B.5  imposes 
some  further  convergence  conditions  beyond  those  of  A.5.  Assumption  B.5(a)  restricts  the  local 
fluctuations  (quadratic  variation)  induced  by  (f  +  \)~Aq*  in  (2.4)  to  be  compatible  with  those  of  a 
Wiener  process.  Assumption  B.5(b,c)  (together  with  B.2)  ensures  that  the  effects  of  the  second 
term  and  the  last  term  in  (2.5)  eventually  vanish. 

The  asymptotic  normality  result  can  be  stated  as  follows. 

THEOREM  2c4:  Suppose  Assumptions  B.1-B.3  and  B.5  hold,  and  that  0,  ->0*  a.s.-P,  where 
{0,}  is  generated  by  (1.1)  with  0O  arbitrary,  at  =  (?+ 1)"1,  and  6*  is  an  isolated  element  of  0*. 
Then: 

(a)     [U,}  is  tight  in  Rk. 


•li- 


fe)   z-I7    «•;<- 


J=- 


(c)  {£/'(•)}  converges  weakly  to  the  stationary  solution  of  dU(r)  =  HU(r)  dv  +  I  'dWCr), 
where      W(-)      denotes      the      standard      &-variate      Wiener      process.       In      particular, 

(t+l)Yl(dt-d*)->N(0,F*),  where  Fm  =  J~  cxp[Hs]  I.Qxp[H's]ds  is  the  unique  solution  to  the 

matrix  equation  HF*  +  F*H'  =  -I. 

(d)  If  H*  is  symmetric,  then  F*  -MLM\  where  M  is  the  orthogonal  matrix  such  that 
MAM'  =  -//*,  with  A  the  diagonal  matrix  containing  the  eigenvalues  (X\,...,Xk)  of-//*  in 
decreasing  order,  and  L  has  (ij)  element  (A,  +  A;-l)_1  A";y ,  where  K  =  M' ZM.        D 

If  at  is  chosen  to  be  (t  +  l)_1i4  (for  finite  nonsingular  kxk  matrix  A),  then  the  SDE  in 
Theorem  2.4(c)  becomes  dU(r)  =  HU(T)dr  +  AU'dWir),  and  the  covariance  matrix  of  the 
asymptotic  distribution  becomes  AF*A\  Part  (d)  gives  an  alternative  expression  for  the 
covariance  matrix  of  the  asymptotic  distribution,  analogous  to  that  given  by  Fabian  (1968). 
Despite  the  assumed  stationarity,  Theorem  2.4  generalizes  previous  results  in  that  the  random 
variables  can  be  unbounded  and  the  measurement  can  be  correlated  (cf.  Ljung  and  Soderstrom, 
1983,  Ch.  4,  and  Fabian,  1968). 

Again,  the  properties  of  mixingales  can  be  exploited  to  verify  the  convergence  conditions. 
We  impose 

ASSUMPTION  B.5': 

(a)     (i)  {y/7,  IF' }  is  a  mixingale  of  size  -2  with  ct  <  K  for  some  K  <  0,  t  =  1, 2, ..., ; 

(ii)  there  exists  a  constant  K  <<*>  and  sequence  of  real  numbers   {bt}   such  that 
\E{tf  y^;y  |  F°)-Gj  ||2  <Kbt  for  all ;,  and  (M  is  of  size  -2. 
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(b)  {/u(Z,)-£(/u(Zr))f  IF1},  {Vy*-H\  IF1}  and  {  I  V9y*t  \-h\  IF1}  are  mixingales 
of  size  -1/2. 

We  have  the  following  result. 

COROLLARY  2.5:  Suppose  Assumptions  B.1-B.3  and  B.5'  hold  and  that  Ot^>d*a.s.-P 
where  {dt}  is  generated  by  (1.1)  with  60  arbitrary,  at  =  (?+  I)"1  and  6*  is  an  isolated  element  of 
0*.  Then  the  conclusions  of  Theorem  2.4  hold.        □ 

This  considerably  generalizes  an  analogous  result  of  White  (1989,  Proposition  4.1)  from  the  Li.d. 
uniformly  bounded  case  to  the  stationary  dependent  case.  Englund,  Hoist  and  Ruppert  (1988) 
also  give  a  result  for  i.i.d.  observations. 

3.     RECURSIVE  NONLINEAR  LEAST  SQUARES  ESTIMATION 

Suppose  the  nonlinear  model  f(Xt,8)  (f:IRpxD->IR,Xt  a  random  p  x  1  vector, 
Se  D  <z.  Rk)  is  to  be  used  to  forecast  the  random  variable  Yt.  It  is  common  to  seek  8*,  a  solution 
to  the  problem 

mmE([Yt-f(Xt,8)]2), 

Se  D 

and  form  a  forecast  f(Xt,  8*).  The  solution  <5*  is  also  a  solution  to  the  problem 

V  (8)  =  £(V5/(X„  8)  [Yt  -f(Xt,  8)])  =  0, 

where  V5  is  the  gradient  operator  with  respect  to  8  yielding  a  k  x  1  column  vector.  The  simple 
RM  algorithm  for  this  problem  in  nonlinear  least  squares  regression  is  the  algorithm  (1.1)  with 

V(Zt,  6)  =  Vsf(Xt,  8)  [Yt  -f(Xt,  8)1 

where  Zt  =  (Y„Xt)  and  6-8.  The  updating  equation  is 
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8t+l=8t+atV5ft[Yt-ft],  (3J) 

where  we  have  written/,  =  /(X,, 5,),  V^/,  =  V5/(X,, 8t).  This  is  known  as  a  "stochastic  gradient 
method."  In  this  section  we  consider  the  properties  of  this  algorithm  and  two  useful  variants,  the 
"quick"  and  the  "modified"  RM  algorithms. 

A  disadvantage  of  the  simple  RM  algorithm  is  that  it  may  converge  very  slowly  (e.g., 
White,  1988).  To  improve  the  speed  of  convergence,  a  natural  modification  is  to  take  an 
approximate  Gauss-Newton  step  at  each  stage.  This  yields  the  modified  RM  algorithm,  also 
known  as  the  "stochastic  Newton  method."  The  algorithm  is  given  by  (1.1)  with 


y(z„0)  = 


Y2(Zt,  6) 


V!(Z„  6)  =  vec  [VsfQCt,  8)  V5/(X„  8)  -  G], 

y/2(Zt,  9)  =  G"1  V^X,,  8)  [Yt  -f(Xt,  8)] 
where  0  =  ((vec  G) ,  8 ) .  The  updating  equations  are  then 

Gt+l  =Gt  +  at  [Vsft  Vsft-Gt],  (3.2a) 

<5/+1  =5,  +a,G7ii  Vsft  [Yfftl  (3.2b) 

We  take  G0  to  be  an  arbitrary  positive-definite  symmetric  matrix. 

The  difficulties  of  applying  this  algorithm  are:  (1)  the  inversion  of  G/+1  is  computationally 
demanding,  and  (2)  the  updating  estimates  G,  need  not  be  positive-definite,  pointing  the 
algorithm  in  the  wrong  direction. 

The  first  problem  can  be  solved  by  use  of  the  rank  one  updating  formula  for  the  matrix 
inverse.    Let  Pl+i=Gtli   and  X,  =(l  -at)l  at.    The  modified  RM  algorithm  is  algebraically 
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equivalent  to 

A+i  =  (Or  */)_1  A  -  A  Vsfi  Vsft  h  l  Waft  h  Vsft  +A,)].  (3.3a) 

<5,+1  -£,  +  *£i+i  V5/,  [*,-/,],  (3.3b) 

cf.  Ljung  and  Soderstrom  (1983,  Chap.  2  &  3).  The  choice  P0  =  Ik  is  often  convenient 

A 

To  ensure  that  Gt  is  positive-definite,  we  may  use  the  following  modification  of  (3.2a): 

G,+1  =  G,  +  at  [Vsft  V5ft  -G,].  (3.4a) 


G/+i  = 


Gr+i,  if  Gr+i  -e/  is  positive-semide finite 

(3.4b) 
Gt+i  +Mt+i(e),  otherwise, 


where  e  is  some  predetermined  positive  number,  and  M,+1(e)  is  chosen  so  that  G,+1  -el  is 
positive-semidefinite.  Some  practical  implementations  of  this  can  be  found  in  Ljung  and 
Soderstrom  (1983,  Ch.  6).  A  similar  device  can  be  applied  to  P,.  Implementation  of  this 
algorithm  will  be  understood  to  employ  a  projection  device  restricting  8t  to  a  compact  set  D  and 

A  A 

G,  to  a  compact  convex  set  T  such  that  the  maximum  and  minimum  eigenvalues  of  G,  lie  in  a 
bounded  strictly  positive  interval. 

A  simplification  of  the  modified  RM  algorithm  is  to  choose  G  to  be  a  diagonal  matrix.  In 
particular,  we  take  G  -c  Ik,  where  c  is  a  positive  scalar,  so  that  matrix  inversion  is  avoided.  This 
yields  the  quick  RM  algorithm,  the  algorithm  (1.1)  with  y/  =  [y/\,  y/2] \  where  now 

\iri(Z„0)  =  V5/(Xf,  8)  V5/(X„  8)-c, 

y/2(Z„  6)  =  c"1  V5/(X„  S)[Y,  -f{Xt,  8)1 
so  that  the  updating  equations  become 
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ct+l  =  c,  +  at[Vsf't  Vsft  -  ct]  (3.5a) 

*i+i-ii  +  ^criiV,/fir(-/f].  0.5b) 

The  scalar  ct  can  be  easily  modified  to  be  positive  in  a  manner  analogous  to  (3.4);  we  also 
restrict  c,  to  be  bounded.  The  quick  RM  algorithm  is  a  compromise  of  the  other  two  algorithms 
in  that  it  takes  a  negative  gradient  direction  with  a  scaling  factor  utilizing  some  local  curvature 
information.  Consequently,  the  quick  algorithm  ought  to  converge  more  quickly  than  the  simple 
algorithm  but  more  slowly  than  the  modified  algorithm.  When  at  =  {t  +  l)"1,  the  quick  algorithm 
then  reduces  to  the  "quick  and  dirty"  algorithm  of  Albert  and  Gardner  (1967,  Ch.  7). 

It  is  straightforward  to  impose  conditions  ensuring  the  validity  of  all  assumptions  required 
for  the  convergence  results  of  the  preceding  section.  Only  the  mixingale  assumptions  A.5'  and 
B.5'  require  particular  attention.  We  make  use  of  a  convenient  and  fairly  general  class  of 
mixingales,  near  epoch  dependent  (NED)  functions  of  mixing  processes  (Billingsley,  1969, 
McLeish,  1975,  Gallant  and  White,  1988). 

Let  { Vt }  be  a  stochastic  process  on  (Q,  F,  P)  and  define  the  mixing  coefficients 

0mSSUprSUp{F€/p_G6/F-B:/'(F)>O}  I  P(P  I  F)-P(G)  I 

am  =  supTsup{/reiP_Ge/F-j  I  P(Gr\F)-P(G)P(F)  I  , 

where  Ftx=c(yx, ..., Vt).  When  0m-»O  or  am->0  as  m->°°  we  say  that  {Vt}  is  ^-mixing 
(uniform  mixing)  or  ce-mixing  (strong  mixing).  When  (j>m  &  0{mk)  for  some  X  <  -  a  we  say  that 
{ V, }  is  0-mixing  of  size  -a,  and  similarly  forctm.  We  use  the  following  definition  of  near  epoch 
dependence,  where  we  adopt  the  notation  £j±™(  •)  =  £(•  I  F't±%). 

DEFINITION  3.1:  Let  {Z(}  be  a  sequence  of  random  variables  belonging  to  L2(P),  and  let  {Vt} 
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be  a  .stochastic  process  on  (Q,  IF,  P).  Then  {Z,}  is  near  epoch  dependent  (NED)  on  {Vt}  of  size 
-aifvmmsuptlZt-Eit%l(.Zt)$2  isof  size-a.         □ 

The  following  three  results  make  it  straightforward  to  impose  conditions  sufficing  for 
Assumptions  A.5'  and  B.5\  The  first  is  obtained  by  following  the  argument  of  Theorem  3.1  of 
McLeish  (1975).  The  second  simplifies  a  result  of  Andrews  (1989).  The  third  allows  simple 
treatment  of  products  of  NED  sequences. 

PROPOSITION  3.2:  Let  {Z,  e  Lr(P) } ,  r  >  2  be  NED  on  { Vt }  of  size  -a,  where  { Vt }  is  a  mixing 
sequence  with  <f>m  of  size  -ar  /(r  -  1)  or  am  of  size  -2ar  I  (r-2),  r  >  2.  Then  {Z,  -E(Zt)}  is  a 
mixingale  of  size  -a.        □ 

PROPOSITION  3.3:  Let  [Zt]  satisfy  the  conditions  of  Proposition  3.2.  Letg  :  Rs  ->  R  satisfy 
a  Lipschitz  condition,  I  g(zi)-g(z2)  I  £L  I  zx  -z2  I  ,L  <«,zlfZ2,  e  Rs.  Then  {g(Z,)e  Lr(P)} 
is  NED  on  {V,}  of  size  -a.  If  {V,}  satisfies  the  conditions  of  Proposition  3.2,  then 
{ g  (Zt)-E(g(Zt)) }  is  a  mixingale  of  size  -a.        □ 

PROPOSITION  3.4:  Let  { Ut }  and  { Wt }  be  two  sequences  NED  on  { Vt }  of  size  -a. 

(a)  If  sup,  I  W,  I  <A<ooandsup/||(/I||4<A<«>,thensupr||f/,Wf||4<A2and  [UtW,}  is  NED 
on  { Vt }  of  size  -a  1 2. 

(b)  Ifsup,||VVr||8<A<ooandsup/||f//||8<A<oo,thensup/||£//W/||4<A2and{£//^}  is  NED 
on  { Vt }  of  size  -a  I  2. 

(c)  If  sup,  ||  Ut  ||s  £  A  <  oo  and  { V, }  satisfies  the  conditions  of  Proposition  3.2,  then  there  exist 
K  <  oo  and  a  sequence  of  real  numbers  {£,}  such  that  sup;>o  \\E(U,  Ut+j  \  F°)-E(Ut  £//+y)||2  <Kbt 
and  6,  is  of  size  -a  12.        U 
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Our  subsequent  results  will  make  use  of  Proposition  3.4(a),  requiring  sup,  ||  Yt  ||4  <  A  and  a  bound 
on  the  elements  of  Xt.  Part  (b)  illustrates  use  of  the  Cauchy-Schwartz  inequality  to  relax  the 
boundedness  condition;  the  price  for  this  is  a  corresponding  strengthening  of  moment  conditions 
on  Ut  (corresponding  to  Yt).  Here  we  shall  adopt  boundedness  conditions  on  Xt  to  minimize 
moment  conditions  placed  on  Yt  and  facilitate  verification  of  the  Lipschitz  condition  of 
Proposition  3.3.  Part  (c)  permits  verification  of  Assumption  B.5'  (a.ii). 

We  impose  the  following  conditions. 

ASSUMPTION  CI:  Assumption  A.l  holds,  and  {Z,}  is  NED  on  {Vt}  of  size  -1,  where 
Zt  =  (Ylt  X,)  with  Xt  bounded  and  sup,  ||  Yt  ||  r  <  A  <  «,  and  { Vt }  is  a  mixing  sequence  on  ( ft,  F,  P) 
with  <pm  of  size  -r  /  2(r  -  1),  or  am  of  size  -r  I  (r  -  2),  r>  4. 

ASSUMPTION  C.2:  / :  RpxD  ->  R  is  jointly  measurable,  where  D  is  a  compact  subset  of  Rk. 
For  each  x  e  Rp,  f(x,-)  is  continuously  differentiable,  and  f{x,  • )  and  V5/(x,  •)  each  satisfy  a 
Lipschitz  condition  with  Lipschitz  constants  Li  (x)  and  L2(x),  where  Lx  and  L2  are  each  Lipschitz 
continuous  in  x.  For  each  8e  D,f(-,  8)  and  V5/(  • ,  8)  each  satisfies  a  Lipschitz  condition. 

When  bounds  on  Xt  are  undesirable,  either  Proposition  3.4(b)  or  a  result  of  Gallant  and  White 
(1988,  Theorem  4.2)  can  be  applied  instead  of  Propositions  3.3  and  3.4(a)  to  provide  the  needed 
relaxation.  However,  the  cost  of  this  relaxation  is  an  increase  in  the  strength  and  complexity  of 
the  memory  conditions  (see  Kuan,  1989  for  specific  details).  In  many  applications  and  in 
particular  in  our  subsequent  neural  network  application,  it  is  possible  to  place  bounds  on  Xt 
without  essential  loss  of  generality,  thus  permitting  the  consequent  generality  and  simplicity  of 
the  sufficient  memory  conditions  (Assumption  C.l)  and  the  relative  simplicity  of  Assumption 
C.2.  We  give  further  comments  in  the  next  section. 
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ASSUMPTION  C.3: 

(a)  For  each  5  in  D,  E^S)  =  lim,  _„  E(Vsf(Xt,  8)  Vsf(Xt,  8))  exists. 

(b)  For  each  8  in  £>,  E2(<5)  =  limr  _  «  £(V5 /(X„  8)  [Yt  -f(Xt,  8)])  exists. 

Note  that  Ex  and  H2  are  continuous  on  D  given  C.1-C.2  as  a  consequence  of  the  localized  version 
of  Theorem  16.8(i)  of  Billingsley  (1979). 

We  have  the  following  consequence  of  Corollary  2.3. 

COROLLARY  3.5:  Given  Assumptions  C.1-C.3  and  A.4\  let  {0,}  be  given  by  (3.1),  (3.2)  or 
(3.5)  (the  simple,  modified  and  quick  algorithms  respectively)  with  0O  chosen  arbitrarily.  Then 
the  conclusions  of  Theorem  2. 1  hold.        □ 

Note  that  Assumption  C.3(a)  is  unnecessary  for  the  simple  algorithm. 

Because  0*  is  an  asymptotically  stable  equilibrium  point  of  ¥(0),  6*  cannot  be  a  saddle 
point.  Corollary  3.5  thus  shows  that  the  RM  estimates  8t  converge  to  8*,  a  local  minimum  of  the 

limit  of  mean  squared  error,  limr  E[Y,  -f(Xt,  8)]2.  By  the  Toeplitz  lemma,  n~l  £n=1  EY(zt>  0) 
converges  to  the  same  limit  ¥(0)  as  n  -» «>,  so  that  8*  also  locally  minimizes 
lim„  n~l  £n  E[Yt  -f(Xt,  8)]2.  The  criterion  functions  for  on-line  and  off-line  estimation 
methods  thus  coincide,  so  that  the  RM  estimators  tend  to  the  same  limit(s)  as  the  nonlinear  least 
squares  estimator  (cf.  Ljung  and  Soderstrom,  1983). 

Corollary  3.5  is  more  general  than  the  i.i.d.  case  treated  by  White  (1989)  and  the  examples 
given  in  Kushner  and  Clark  (1978,  Chap.  2),  as  we  allow  the  data  to  be  moderately  dependent 
and  heterogeneous.  This  result  differs  from  those  of  Metivier  and  Priouret  (1984)  in  that  we 
require  neither  "conditional  independence"  nor  stationarity. 
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Corollary  3.5  also  generalizes  a  result  of  Ruppert  (1983).  Ruppert  assumes  that  for  some  <5* 
Yt  =f(Xt,  S*)  +et  and  that  (X„e,)  is  strong  mixing  of  size  -r/(r-2),  a  condition  that  may  fail 
when  Xt  contains  lagged  Yt,  because  Yt  need  not  be  mixing  when  it  is  generated  in  this  manner, 
even  whene,  and  other  elements  of  X,  are  mixing.  Indeed,  this  fact  partially  motivates  our  usage 
of  near  epoch  dependence.  Also,  we  do  not  require  that  Y,  is  generated  in  the  manner  assumed 
by  Ruppert  (i.e.,  we  may  be  estimating  a  "misspecified"  model).  Compared  to  the  result  of  Ljung 
and  Soderstrom  (1983),  we  allow  more  dependence  in  the  data,  as  the  data  need  not  be  generated 
by  a  linear  filter. 

The  modified  RM  algorithm  can  be  identified  with  the  extended  Kalman  filter  for  the 
nonlinear  signal  model 

Yt=f(Xt,8t)+et 

5t  =  80  for  all  t. 

The  Kalman  gain  is  at  Pt+i^sft-  Corollary  3.5  thus  provides  conditions  more  general  than 
previously  available  ensuring  consistency  of  the  filter.  In  particular,  the  model  can  be 
misspecified  and  the  data  can  be  NED  on  some  underlying  mixing  sequence. 

Because  the  quick  RM  algorithm  includes  Albert  and  Gardner's  quick  and  dirty  algorithm, 
Corollary  3.5  direcdy  generalizes  their  consistency  result  to  the  case  of  dependent  observations. 

To  obtain  asymptotic  normality  results  for  the  case  of  nonlinear  regression,  we  impose  the 
following  conditions. 

ASSUMPTION  D.l:  Assumption  C.l  holds  such  that  {Z,}  is  a  stationary  sequence  NED  on 
{Vt}  of  size  -8  with  1 7,  ||r  <  A  <  ~,  r  >  8. 

ASSUMPTION   D.2:   For  algorithms    (3.1),   (3.2),   and   (3.5)   Assumption   B.3   holds   for 
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$*=$*,  0*  =  ((vecG*Y,  5*')',  and  0*  =(c  *,£*'),  respectively,  where  G*  =£(V5/;  V5/?')  and 
C*=*r(G*). 

ASSUMPTION  D.3:  Assumption  C.2  holds  such  that  for  each  x  e  Rp,  f(x,  • )  is  continuously 
differentiable  of  order  2  in  a  neighborhood  of  <5*  such  that  V5sf(x,  • )  satisfies  a  Lipschitz 
condition  in  a  neighborhood  of  <5*  with  Lipschitz  constant  L3CO,  where  L3  is  Lipschitz 
continuous  in  r,  and  V^C  • ,  <5* )  satisfies  a  Lipschitz  condition. 

COROLLARY  3.6:  Suppose  Assumptions  D.1-D.3  hold  and  that  0,  ->0*  a.s.-P  where  {0,}  is 

A, 

generated  by  (3.1),  (3.2)  or  (3.5)  with  0O  chosen  arbitrarily,  at  =  (t  +  l)-1,  and  0*  is  an  isolated 
element  of  0*.  Then  the  conclusions  of  Theorem  2.4  hold. 


In 


particular, 


for 


(3.1) 


(t  +  if  (St  -  <T)  %  N(Q,  F|),  where 


F\=i^exp[Hls]IlXGxp[Hls)dstHi=Hl+Ik/2t    H\  =£(V«/J  (Yt-ft)-Vtft  V^O,    2,  = 

IL^E&sfteUUjVsfl'+j)*    with    /W(X„<T),    Vsf;SVsf(Xt1d*\    Vssfl^VssfiXtiS*), 
e;=Yt-fl 

For  (3.2),  (r  +  l)'/j  (0,  -0* )  4  N(0,  FJ),  where 


Fj  =  1°°  exp  [#2  J]  ^2  exp  [H2  s]  ds,  H2=H\+lkl  2, 


H?  =E 


F&ft'  {Yt-ft)®Ik]VG(yec  G~l)      G*~l  [V«/J  (Yt-ft)-  V^  V^'] 


I2  =  2"=-  £ 


¥u¥u+j       ¥u¥2,t+j 
Y2t¥u+j       ¥it¥i,t+j 


,  with 
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For   (3.3),   (t+lf@t-0*)^>N(OtFl)t   where   F\  =  J°°  exp  [H3  s]  I3  exp  [//3  5]  ds,  5 


3  = 


«3  +/*/2, 


//3    =  E 


-1  2va/;'v5a/; 


i3  =  17=^0  £ 


with 


COROLLARY  3.7:  Given  the  conditions  of  Corollary  3.6,  suppose  that/?  =E(Yt  I  Xt,Zt.x ,...), 
and  define  it  =  £  (£  * 2  ^fc/J  V&fi  *).  We  have  the  following  results  for  5 ,. 

For  (3.1),  (1  +  l)'7'  (<5,  -<5*)4  tf(0,  Fj),  where 
F\   =  \°°  exp[Hls]L0lQxp[Hl's]ds,  H^Hl  +Ik/2,    H\=-G*. 


For  (3.2),  (r  +  if1  (5, -(T)  4  N (0,  G*"1  S,  G*"1) 


For  (3.3),  (1  +  l)*  (5,-5*)  ->  /V  ( 0,  F3),  where 


-•  poo 

F3   =  J    exp 

o 


'-1 


*_1    ^,* 


(-c  _1  G    +V2)5 


(c*  2Ii)exp 


*-i^* 


(-c     XG    +  /*/2)s 


ds  . 


Further,  F\  -G*  !  Z\  G*"1  and  F3-G*  l  Ii  G*  *  are  positive  semide finite  matrices.     □ 

The  results  for  (3.2)  are  an  extension  of  those  of  Ljung  and  Soderstrom  (1983,  p.  192). 
Note    that    in    Corollary    3.7    the    covariance    matrix    in    the    correcdy    specified    case 
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(f*  =  E(Yt  I  X„Zt-.i, ...))  coincides  with  that  of  the  off-line  estimator  (cf.  White  (1989,  Proposition 
5.1)).  Our  final  conclusions  establish  the  asymptotic  efficiency  of  (3.2)  relative  to  (3.1)  and 
(3.5). 

4.    NEURAL  NETWORK  LEARNING 

White  (1989)  studied  the  simple  RM  algorithm  for  estimating  the  parameters  of  a  class  of 
nonlinear  regression  models  proposed  by  cognitive  scientists,  hidden  layer  feedforward 
networks  (Rumelhart,  Hinton  and  Williams,  1986).  In  this  context,  the  simple  RM  algorithm  is 
known  as  the  method  of  "back-propagation"  for  "neural  network  learning."  White  (1989) 
considered  the  case  of  bounded  i.i.d.  processes  {Z,}.  In  this  section,  we  apply  our  earlier  results 
to  study  the  convergence  properties  of  the  simple,  modified  and  quick  RM  algorithms  for 
estimating  the  parameters  of  single  hidden  layer  feedforward  networks  in  the  case  of  dependent 
observations.  We  thus  obtain  results  for  generalizations  of  the  method  of  back-propagation 
useful  for  learning  approximations  to  nonlinear  relationships  among  time  series  processes. 

Specifically,  we  consider  least  squares  approximation  to  the  regression  function 
g(Xt)  =  E(Y[  I  Xt)  using  single  hidden  layer  feedforward  network  models  of  the  form 

f(x,8)=Po+J:PjF&'rj)  .  (4.1) 

where  x  =  (l,x'),d  =  (p',Yi,p  =  (p0,...tpq),  7  =  (7i'.-.7<7')\<7  e  IN,  and  F  :  R  ->  R  is  a  given 
function  (the  "hidden  layer  activation  function")  with  properties  described  formally  below. 
Hornik,  Stinchcombe  and  White  (1989a)  show  that  when  F  is  a  cumulative  distribution  function 
(c.d.f.),  then  there  exist  q  sufficiently  large  and  8*  such  that  /(•,  8*)  provides  an  arbitrarily 
accurate  approximation  to  g.  Stinchcombe  and  White  (1989)  and  Hornik,  Stinchcombe  and 
White  (1989b)  provide  alternative  conditions  on  F  (e.g.,  F  a  density  function)  establishing 
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similar  approximation  properties. 

For  simplicity,  we  take  q  to  be  fixed  and  we  seek  a  local  solution  8*  to  the  problem 
min  [E([Yt  -f(Xt,  8)]2)  =  E(\Yt  -  g(Xt)]2)  +  E([g(Xt)-f(Xt,  S)]2)] 

by  seeking  a  solution  to  the  problem 

E2(8)  =  lim  EC7sf(Xt,  8)  [Yt  -f(Xt,  8)])  =  0 

t  — »oo 

using  the  simple,  modified  and  quick  RM  algorithms. 

For  this  we  impose  appropriate  conditions.  In  particular,  we  adopt  Assumption  CI.  The 
assumption  of  uniformly  bounded  Xt  causes  no  loss  of  generality  in  the  present  context.  This  is  a 
consequence  of  the  fact  that  E(Yt  I  Xt)  =  E(Y,  I  Xt)  where  Xu  =  v(Xti),  i  =  1, ...,  r  and  v  :  R  ->  [0, 1] 
is  a  strictly  increasing  continuous  function.  If  Xt  is  not  uniformly  bounded  then  Xt  is,  and  we 
seek  an  approximation  to  g(Xt)  =  E(Yt  I  Xt).  We  revert  to  our  original  notation  in  what  follows, 
with  the  implicit  understanding  that  Xt  has  been  transformed  so  that  Assumption  C.l  holds. 
Note,  however,  that  Y,  is  not  assumed  bounded,  providing  the  desired  generality. 

ASSUMPTION  E.l:  /:  Rp  xD  -» R  is  given  by  (4.1)  where  D  =BxT,  with  B  and  T  compact 
subsets  of  Rq+X  and  R1<p+V  respectively,  and  with  F:  R  -»  R  a  bounded  function  continuously 
dififerentiable  of  order  3. 

The  conditions  on  F  are  readily  verified  for  the  logistic  c.d.f.  and  hyperbolic  tangent  "squashers" 
commonly  used  in  neural  network  applications. 

COROLLARY  4.1:  Given  Assumptions  C.l,  E.l,  C.3  and  A.4\  let  [6 1)  be  given  by  (3.1),  (3.2) 
or  (3.5)  (the  simple  modified  and  quick  algorithms,  respectively)  with  0O  chosen  arbitrarily. 
Then  the  conclusions  of  Theorem  2. 1  hold.        □ 
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Thus  the  method  of  back-propagation  and  its  generalizations  converge  to  a  parameter  vector 
giving  a  locally  mean  square  optimal  approximation  to  the  conditional  expectation  function 
E(Yt  I  Xt)  under  general  conditions  on  the  stochastic  process  {Z,}.  This  result  considerably 
generalizes  Theorem  3.2  of  White  (1989). 

For  the  asymptotic  distribution  results,  we  impose  the  following  condition. 

ASSUMPTION  F.l:  Assumption  E.l  holds  with  F  continuously  differentiable  of  order  4. 

COROLLARY  4.2:  Suppose  Assumptions  D.l,  D.2  and  F.l  hold  and  that  6t  -^>Q*a.s.-P  where 
{6t}  is  generated  by  (3.1),  (3.2)  or  (3.5)  with  90  chosen  arbitrarily,  at  =  (t  +  l)"1,  and  6*  is  an 
isolated  element  of  0*.  Then  the  conclusions  of  Theorem  2.4  hold. 

In  particular,  the  results  for  (t  +  l)Yl  (8t  -  8m)  of  Corollaries  3.6  and  3.7  hold  for  the  choice 
of/  given  in  As  sumption  F.  1 .        □ 

This  result  generalizes  Theorem  4.2  of  White  (1989)  for  the  case  of  bounded  i.i.d.  {Zt }.  We 
note  also  that  the  modified  RJV1  algorithm  delivers  an  estimator  with  asymptotic  covariance 
matrix  equal  to  that  of  the  one-step  estimator  given  in  Theorem  5.3  of  White  (1989)  provided 
that  {e*,  Ft  =cr(...,Z/_1,Zr,Xf+1)}  is  a  martingale  difference  sequence. 

5.     SUMMARY  AND  DIRECTIONS  FOR  FURTHER  RESEARCH 

We  have  applied  the  results  of  Kushner  and  Clark  (1978)  and  Kushner  and  Hwang  (1979) 
to  establish  the  consistency  and  asymptotic  normality  of  the  Robbins-Monro  recursive 
m  -estimator  under  conditions  allowing  moderate  dependence  in  the  underlying  stochastic 
process  {Z,}.  Our  consistency  results  impose  asymptotic  stationarity  on  the  expectation 
E(y/(Zt,0));  asymptotic  normality  results  impose  strict  stationarity  on  {Z{}.  Our  conditions  are 
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chosen  not  to  be  the  most  general  possible  (see  KC  and  KH  for  those),  but  to  provide  readily 
interpretable  and/or  verifiable  conditions  without  making  great  sacrifices  in  generality.  We 
consider  three  implementations  of  the  RM  procedure  for  nonlinear  regression  as  special  cases, 
and  further  specialize  these  to  study  methods  for  "learning"  in  an  interesting  class  of  neural 
network  models.  As  described  in  previous  sections,  these  applications  generalize  available 
results  in  a  number  of  ways.  In  particular,  we  point  out  that  quick  and  modified  RM  procedures 
for  neural  network  learning  provide  useful  generalizations  of  the  method  of  back-propagation 
that  may  have  improved  convergence  properties. 

There  are  many  interesting  directions  for  further  research.  In  particular,  the  stationarity 
assumptions  are  inappropriate  for  evolving  systems.  Provided  that  data  can  be  acquired  at  a 
rapid  rate  compared  to  the  evolution  of  the  system,  the  RM  algorithm  can  be  modified  to  give  a 
useful  real-time  tracking  algorithm  (at  ->  a  as  f-)~,a>  0).  Results  of  Gerencser  (1986)  and 
Kushner  and  Hwang  (1981)  are  relevant  for  this  extension. 

Application  of  the  present  results  to  robust  m-estimators  will  yield  estimation  and  tracking 
procedures  less  sensitive  to  outliers  and  gross  data  errors  than  the  least  squares  estimators 
considered  here.  For  many  choices  of  y/y  the  analysis  parallels  that  for  the  least  squares  case 
rather  closely.  These  results  are  within  relatively  easy  reach  for  estimation  procedures. 

For  neural  network  models,  it  is  desirable  to  relax  the  assumption  that  q  is  fixed.  Letting 
q  -» oo  as  the  available  sample  becomes  arbitrarily  large  permits  use  of  neural  network  models 
for  purposes  of  non-parametric  estimation.  Off-line  non-parametric  estimation  methods  for  the 
case  of  mixing  processes  are  treated  by  White  (1990)  using  results  for  the  method  of  sieves 
(Grenander,  1981,  White  and  Wooldridge,  1989).  On-line  non-parametric  estimation  methods 
appear  possible,  but  will  require  convergence  to  a  global  optimum  of  the  underlying  least 
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squares  problem,  not  just  the  local  optimum  that  the  present  methods  deliver.  Results  of 
Kushner  (1987)  for  the  method  of  simulated  annealing  provide  hope  that  convergence  to  the 
global  optimum  is  achievable  for  the  case  of  dependent  observations  with  appropriate 
modifications  to  the  RM  procedure. 

Finally,  it  is  of  interest  to  consider  RM  algorithms  for  neural  network  models  that 
generalize  the  feedforward  networks  treated  here  by  allowing  certain  internal  feedbacks.  Such 
"recurrent"  network  models  have  been  considered  by  Jordan  (1986),  Elman  (1988)  and  Williams 
and  Zipser  (1989).  For  example,  in  the  Elman  (1988)  set  up,  hidden  layer  activations  feed  back, 
so  that  network  output  is  Ot  =  F(At'p),  At}  =  G(Xt'  8j  +  At-\ '  <5y),  j  =  1, ...,  q,  where  At  = 
(At0,Atl,...,Atq) \At0  =  \.  This  allows  for  internal  network  memory  and  for  rich  dynamic 
behavior  of  network  output.  Learning  in  such  models  is  complicated  by  the  fact  that  at  any  stage 
of  learning,  network  output  depends  not  only  on  the  entire  past  history  of  inputs  Xt,  but  also  on 
the  entire  past  history  of  estimated  parameters  6 1.  Results  of  Kushner  and  Clark  (1978)  are 
relevant  for  treating  such  internal  feedbacks.  Convergence  of  RM  estimates  in  recurrent 
networks  is  studied  by  Kuan  (1989)  and  Kuan,  Hornik  and  White  (1990). 
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MATHEMATICAL  APPENDIX 


All  notation  and  definitions  are  as  given  in  the  text.  We  begin  with  a  result  required  for  the 
proof  of  Theorem  2. 1 . 

LEMMA  A.1:  The  condition  that  £n     at[Xt  -yt]  converges  a.s.  -P  implies  that  for  some  T  >  0 
and  each  e  >  0, 


hmP 


supmaxIS-^-^K-y,]!  >e 

j>n     i£T     ~t-m(jr) 


=  0, 


(a.1) 


where  m(JT)  =  max{f:  T,<jT)  for  r>0,  {a,}  satisfies  Assumption  A.4,  {Xt}  is  a  sequence  of 
random  variables,  and  {/,}  is  a  sequence  of  bounded  real  numbers.  The  condition  (a.1)  implies 
that  for  each  e  >  0, 


lim  lim  P 

A->0rt 


sup  max 

J>n     iZA 


m(jA+i)-\  a(X[ 


m(jA) 


>e 


=  0, 


(a.2) 


PROOF:  We  first  note  that  the  condition  that  £n    at[Xt-yt]  converges  a.s.  -P  is  equivalent  to 
the  condition  that  for  each  e  >  0, 


lim  P 


sup 


E-^R-n] 


>e 


=  0, 


(a.3) 


see  e.g.,  Lukacs  (1975,  Theorem  2.4.1).  Clearly,  (a.1)  is  implied  by  (a.3),  and  the  first  assertion 
follows. 


Let  M  be  an  upper  bound  for  [yt } .  Because  YT-T(+£  at-T  f°r  all  h  we  have 


max 


'm(/T +/')-! 


Zm(jl  +0-1  a   y 


<MT 


(a.4) 


The  second  assertion  now  follows  from  (a.1)  and  (a.4)  by  invoking  the  triangle  inequality  and 


letting  T  ->  0.        □ 
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PROOF  OF  THEOREM  2.1:  We  verify  the  conditions  in  Theorem  2.4.2  of  KC.  We  first 
observe  that  Assumption  A.2(b.ii)  simplifies  the  bound  given  by  [A.2.4.2]  of  KC,  and  this 
simplification  does  not  affect  the  validity  of  the  original  proof.  [A.2.4.2]  of  KC  also  requires  that 
for  each  T  <  «>, 


lim  sup  j    «3 

f-»°o        o 


f»: 


Z  (T,+S) 


ds  < 


=  1, 


(a.5) 


where  Z°  is  the  piecewise  constant  interpolation  of  {Zt}  with  interpolation  intervals  {at}.  It 
follows  from  p.  66  of  KC  and  Lemma  A.l  that  (a.5)  holds  given  Assumption  A.5(b).  This 
establishes  [A.2.4.2]  of  KC. 

We  next  show  that  Assumptions  A.3  and  A.5(a)  imply  [A.2.4.3]  of  KC.  We  must  show  that 
for  each  0 


sup  max 

j>n     i£T 


Z"Zml  a<  [iMflJiW 


<£ 


->1 


(a.6) 


as  n  ->  °°  where  \y,(6)=\f/(Zt,  0).  Using  the  triangle  inequality  we  can  write 


Zm(jT+i)-l  a 
t=m(jT)  ' 


^(0HF(0)J  |  < 


sri^i"1  a'  [vtfy-Evtf)]  I  +  |ir^_1  a<  [evtvyvm]  \ 


(a.7) 


By  Assumption  A.5(a)  and  Lemma  A.l  we  obtain,  given  £  >  0, 


sup  max 


mOT+O-l 


m(jT) 


a, 


il/tfyEy/tm 


<e/2 


(a.8) 


as     n  -» °°.      We     can      choose     n     sufficiently      large      such      that     for     j  >  n     and 


t>m(JT) 


£y/,(0KF(0) 


<e/27by  Assumption  A.3.  Hence, 


sup  max 

J*n     i£T 


Zm(jT+i)-l  n 


E¥t(ey-v(e) 


<£/2 


(a.9) 
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as  n  ->«>.  (a.6)  now  follows  from  (a.7)-(a.9). 

We  also  observe  that  Assumptions  A.2(b.i)  and  A.5(b)  imply  [A2.4.5]  of  KC  by  Lemma 
A.1,  and  that  all  other  conditions  are  directly  assumed.  Theorem  2.1(a)  and  (c)  now  follow  from 
Theorem  2.4.2  of  KC.  If  0  c  4(0*),  then  0,  lies  in  4(0*)  for  all  t.  Hence,  Theorem  2.1(b) 
follows  from  Theorem  2.1(c). 

Finally,  we  show  that  cycling  between  two  asymptotically  stable  equilibria  is  impossible.  It 
is  easy  to  see  that  points  in  0*  must  be  isolated.  Let0*  and0*.  be  two  isolated  points  in  0*,  and 
let  N£i  and  N£l  be  neighborhoods  of  0*  and  0*.,  respectively,  such  that  N£l  c  4(0* ),  N£i  c  4(0* ), 
and  N£i  p,  N£i  =  0.  If  the  path  of  0,  cycles  between  0*  and  0*.,  d,  must  move  from,  say,  N£l  to 
N^  infinitely  often.  Let  (fj)  be  an  infinite  subsequence  of  {t}  such  that0,.  e  N£i.  Then0''(  • )  is  a 
subsequence  of  0'(  • )  and  has  limit  0  ( • )  satisfying  the  ODE  0  =  #[¥(0)].  But  for  every  T  there  is 
a  r  >  T  such  that  0(f)  e  N£l.  Hence  0(0)  e  N£i  but  0(t)  cannot  converge  to  0*  as  t  ^  °°.  This 
violates  the  asymptotic  stability  of  0*  and  proves  Theorem  2.1(d).        □ 

PROOF  OF  COROLLARY  2.3:  The  result  follows  from  Theorem  2.1  because  the  square 
summability  condition  of  at  in  Assumption  A.4'  implies  at  —>  0  as  t  ->  °°  and  Assumption  A.5' 
implies  Assumption  A.5  by  the  mixingale  convergence  theorem  (McLeish,  1975,  Corollary 
1.8).        □ 

PROOF  OF  THEOREM  2.4:  We  verify  the  conditions  for  Theorem  2  of  KH. 

We  first  observe  that  the  conditions  [Al],  [A4],  [A7]  and  [A8]  of  KH  are  direcdy  assumed, 
and  that  [A3]  of  KH  is  ensured  by  Assumption  B.5(c)  and  Lemma  Al. 

Second,  we  show  that  the  consequence  of  [A2]  of  KH  holds  under  Assumptions  B.2(b)  and 
B.5(b,  c).   This  amounts  to  showing  that  the  second  assertion  in  Lemma  1  of  KH  holds.   By 
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Assumption  B.2(b)  we  have 

jlo[v9¥(z,d%r(et-d*))-^eYyr<h4(z)\lQp2(\r(et-d*)\)dr.  (a.  10) 

A 

Clearly,  the  integral  on  the  RHS  of  (a.  10)  converges  to  zero  a.s.  because  0,  -»0*  a.s.  Let  {ek}  be 
a  sequence  of  positive  real  numbers  such  that  Z^£*  <  °°>  and  let  {Nk}  be  a  sequence  of  integers 

tending  to  infinity  as  k  -> °°.  Define  measurable  sets  A*,  £*,  C*,  D*  and  Fk  as: 


A*  = 


sup    max 


rSftsfVtf4  [v*»?-jr] 


>J?2 


Bk  = 


sup    max 


zSSS^+ir1^;!- 


* 


>p2 


c>  = 


sup    max 


ZTifiS3",»+ir1  [a<(z,>-*i. 


>e2 


ZX  = 


ajJJp2(i»^i-^)i>fr^«i 


oo         f 


i=k 


Fk  =  U    A,-  ^j  £,  u  C,  ^j  £>, 


By  Assumptions  B.2(b),  B.5(b),  (c)  and  the  assumption  that  dt  ->0*  a.s.,  we  can  choose  Nk  large 
enough  such  that 


P[Ak]  +  P{Bk]  +  P[Ck]  +  P{Dk)  <ek 


v-l  ^2 


and(r  +  l)  '  <ekfort>Nk.  Thus, 


£,' 


^i  U  5i  U  Ci-  U  A' 


^1.  /,{Al-}+/>{^}+P{C,-}+P{Dl-} 
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Consequently, 


lunP{Fk)=P 


lim  sup 


&i  U  Bi  U  Ci  U  Di 


=  0, 


where  the  first  equality  follows  from  the  definition  of  Fk,  and  the  second  equality  follows  from 
(a.l  1)  and  the  Borel-Cantelli  Lemma.  In  view  of  (2.5)  we  obtain,  for<u  e  Fk  andj  >  Nk, 

\I?=%f(t+irl[Ht-H]\  <k0el  (a.12) 

where  k0  is  a  constant  and  s  <ek.  (a.  12)  is  precisely  the  result  proven  in  Lemma  1  of  KH. 

We  next  show  that  [A6]  of  KH  holds.  By  Assumption  B.l  and  B.3  we  have  that  {y* }  is  a 
stationary  sequence  and  £|i^*|6<«>.    We  must  show  that  I  =  X°1  aU)  is  bounded.    Let 

J—oo 

*=Ivri2.  Then 

|<7,|  =  \E(yW+j)\  <E  \Y*t  E(¥;'+j\  IF'  )| 

<\\y,;\\2\\E(¥;'+J\  JF')\\2<Rkj  (a.13) 

by  the  Cauchy-Schwartz  inequality  and  the  definition  of  Kj.  Because  kJ  <E\\fr*+j\2  =R2,  it 
follows  from  (a.13)  that  \Gj\  £R3/2ka.  Thus  Z  is  bounded,  given  Assumption  B.5(a.i).  This 
establishes  Theorem  2.4(b). 

We  can  proceed  as  above  to  show  that  2"_0|£(V*V*)I  <00>  ^d  observe  that 
supr£|  Vqy*  I  <  °°  from  Assumption  B.3.  This  establishes  [A5]  of  KH.  Theorem  2.4(a)  and  (c) 
now  follow  from  Theorem  1  and  Theorem  2  of  KH,  respectively. 

In  particular,  because  Assumption  B.3  ensures  that  all  the  eigenvalues  of  H  have  negative 
real   pans,   the   solution   U(r)  of  dU  =  HUdx  + 1  *  dW  is   asymptotically   Gaussian,   so   that 
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d  - 

Ut  — >  N(0,  F* ),  where  F*  is  the  stationary  covariance  matrix  satisfying  HF*  +  F*H  =  -I  (see 

Arnold,  1979,  Ch.  8). 

It  remains  to  prove  Theorem  2.4(d).  Based  on  the  definitions  of  K  and  H  we  can  write 
-K  =  -MTAf  =M'HF*M  +MT*H'M 

=  M\H*+I/2)F*M+M'F*(H*+I/2)M 

=  -AM'F*M-M'F*MA  +  M'F*M.  (a.  14) 

By  (a.  14),  the  (ij)th  element  of  K  is 

[K]ij  =  a.+VlXM'F'M],-,. 
Hence,  L  =  M 'F*M  so  that  F*  =  MLM '  as  asserted.         □ 

PROOF  OF  COROLLARY  2.5:  Only  Assumption  B.5  needs  to  be  verified.  We  observe  that 
Assumption  B.5 '(b)  is  a  mixingale  condition  ensuring  Assumption  B.5(b,  c)  by  the  mixingale 
convergence  theorem.  To  establish  Assumption  B.5(a),  we  see  that  Assumption  B.5'(a.i)  ensures 
that  for  K  <  « 

K,S||£(yO  F°)||2  <;*£*,,,  (a.  15) 

where  £KJ  is  the  mixingale  memory  coefficient.  The  fact  that  £KJ  is  of  size  -2  implies  that 
Z°°_o  ?*.<  <  °°-  Hence  (a.  15)  implies 

This  establishes  Assumption  B.5(a.i).  Similarly,  Assumption  B.5'  (a.ii)  imposes 
{,  =  sup,>o||£(^;  tflj-Ety*  ¥;+J)\F°)\\2<Kbt. 
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That  bt  is  of  size  -2  ensures  that  £     £f  <  <*.  This  establishes  Assumption  B.5(a.ii).         D 


PROOF  OF  PROPOSITION  3.2:  See  Gallant  and  White  (1988,  Lemma  3.14).        D 


PROOF  OF  PROPOSITION  3.3:  See  Andrews  (1989,  Lemma  1).        □ 


PROOF  OF  PROPOSITION  3.4:  (a)  We  first  observe  that 


E\UtWt-E\™{UtWt) 


=  E  [£{^?  [  |  UtWt  - Ei™(UtWt)  | 2] ] 


<e\e\ 


+m 
-m 


utwt-ut>mwt<r 


where  Uttm=E't^(Ut)  and  Wttm  =  Elt™QVt).  Here  we  employ  the  fact  that  E\?£{Ut  Wt)  is  the  best 
L2-predictor  of  Ut  W,  among  all   &[!%  -measurable  functions.  Hence, 


utwt-E\™<ytwt)fo 


^WUtWt-u^w^h 


^  II  ut  Wt  -  Uum  Wt  ||  2  + 1  Ut<m  Wt  -  Uum  Wtjn  ||  2 


<  A  ||  Ut  -  Ut<m  ||  2  + 1  Ur,m  Wt  -  Ut>m  Wttm  ||  2 


(a.  16) 


We  then  observe  that 


E\UttmWt-UttmWlt, 


<E\U,m\2\Wt-Wt<m\ 


<2AE\Ut,m\2\Wt-W,m\ 
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<2A\\tftfn\\2\\Wt-Wttm\\2. 
Consequently  (a.  16)  implies 

\\UtWt-Ett^{UtWt)h 


<k\\Ut-Ut>m\\2  +  ^*Wi-Wttm\\2 


and 


where  A'  =  max{A,  ^2AA  },  andvUWm  is  of  size-c/2. 

(b)     We  use  the  generalized  Holder's  inequality  to  write 
E\UttmWt-UttmW,m\2 


so  that 


\\ut.mwt-ut<mwttm\\2 


Similarly, 


WtWt-UttmWtl2ZJ2  d^lU.-U^i? 


Consequently, 
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(c)     Using  the  same  argument  as  in  (b)  we  can  show 
lUtUt+j-Etilr(PtUt+j)h 

< II Ut  Ut  +j  - E{™(Ut)  E\ Vj™{Ut  +j)  ||2 

< | Ut  Ut  +j  -  Ut  E\  +Jj™{Ut  +j) ||2  + 1| UtE\ Vj™{Ut  +j) - E\™{Ut)  E\ #£(tf,  +j) ||2 

£2>fi"  A3/2v*m. 
Hence,  with  £°(  • )  =  E(  •  I  F°),  we  have 
\\E\UtUt+j)-EUtUt+j\\2 

<\\E° E'tir  (Ut  Ut+j)-EUt  Ut+Jh+lE°[Ut  Ut+j-E'ir(Ut  Ul+j)]\\2  ,  (a.17) 

where  s  =  [t/2]  is  the  integer  part  of  til.  By  Jensen's  inequality,  the  second  term  in  (a.  17)  is 
bounded  by 

\\Ut  Ut+j-Elti+'Wt  Ut+J)\\2<Kvy>tS , 

where  A'  is  a  constant.  It  follows  from  Lemma  2.1  of  McLeish  (1975)  and  Lemma  3.14  of 
Gallant  and  White  (1988)  that  the  first  term  in  (a.  17)  is  bounded  by  Kav^{  or  K<p\;^. 
Choosing  bt  =  ajj^f  +vu  [r/21  or k  ~$[ta{  +vu  uu]  gives  ^e  desired  result.        D 

PROOF  OF  COROLLARY  3.5:  We  verify  the  conditions  of  Corollary  2.3.  Because  the  other 
conditions  obviously  hold,  for  the  simple  RM  estimates  it  suffices  to  show  that  Assumptions 
A.2(b)  and  A.5'  hold.  Given  Assumption  C.2,  it  is  straightforward  to  verify  that  /  and  V5/  are 
such  that  \f{x,S)\  <  Qx(x)  and  I  Vsf(x,  S)\  <  Q2(x)  for  all  8e  D  (compact),  where  Qx  and  Q2  are 
Lipschitz  continuous  in  x.  Therefore, 


-36- 


£Qi(x)[\y\+Qi(x)], 


so  that  Assumption  A.2(b.i)  holds  for b(6)  -  l,/ii(z)  =  1,  and h2(z)  =  Qi(x)[\y  \  +Qi(x)].  Also, 


|yr(z,0i)-y(2,02)| 


=  I  Va/(x,  SOly  -fix,  80}  -  Vsftx,  52>\y  -fix,  %)] 


S  I Vato  SO? - V&f(x,  82)y\  +  \ VafCt,  SO/Ct,  ^) - Va/U,  S^Gc,  ft)  | 


(a.18) 


It  follows  from  Assumption  C.2  that 


and 


I  Va/U,  SOy  -  Va/U,  ft)y  |  £  |y  |L2(x)  |  ^  -  %  |  , 


Va/(*,  <52)/Ct,  £)- Va/Ct,  5i)/(;t,  *)■ 


^  |Va/(*.  %yu.  5z)-V^,  %)/(*,  50|  +  |  Vafe  %)/(*.  <*i)- Va/U,  $i)/(x,  $i)| 


^  |Va/"(^  ft)|LiOf)|«i-«|  +  |/Cc,  *)|L2(x)|«i-* 


<t> 


2WiW+eiWW 


*-<% 


Hence  (a.  18)  becomes 


yr(z,di)-yf(z,02) 


<h3(z)\8l-82\1 


where 


A3(z)  ^  [b  IL^)  +  fi2(^iU)  +  QiGO^GO 


This  establishes  Assumption  A.2(b.ii). 
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Because  lyl , Li(x),  L2(x),  Qi(x)  and  Q2(x)  satisfy  Lipschitz  conditions.,  Proposition  3.3 
ensures  that  17,1  ,Li(Xt),  L2(Xt)  Qi(Xt)  and  Q2(Xt)  are  NED  on  [Vt]  of  size  -1.  Because  Xt  is 
bounded,  Q\(Xt),  Q2(Xt),  Li(Xt)  and  L2{Xt)  are  bounded.  Because  ||y,||4<A,  it  follows  from 
Proposition  3.4(a)  and  Corollary  4.3(a)  of  Gallant  and  White  (1988)  (i.e.,  sums  of  random 
variables  NED  of  size  -a  are  also  NED  of  size  -a)  that  h3(Zt)  is  NED  on  {Vt}  of  size  -1/2.  The 
mixing  conditions  of  Assumption  C.l  then  ensure  that  {/*3(Z,)-£/i3(Zt)}  is  a  mixingale  of  size 
-1/2  by  Proposition  3.2.  Similarly,  [h2(Zt)  - Eh2(Zt)}  is  a  mixingale  of  size  -1/2,  establishing 
Assumption  A.5'(ii). 

We  next  verify  that  for  each  6  e  0,  {y/(Zt,0)}  is  a  mixingale  of  size  -1/2.  Fix  5(=0). 
Observe  that  the  Lipschitz  condition  on  /( •  ,<5)  and  the  conditions  on  {Z, }  imply  by  Proposition 
3.3  that  {f(Xt,  6)}  is  NED  on  [Vt]  of  size  -1.  The  triangle  inequality  implies  that  [Yt  -f(Xtt  8))  is 
NED  on  {Vt}  of  size  -1,  and  the  boundedness  of  Xt,  the  continuity  of /( • ,  5),  and  the  fact  that 
||y,||4<A<°°  implies  that  \\Yt-f(Xt,  5%4  <  A  <  ~.  The  Lipschitz  condition  on  [Vsf(-tS))  and 
the  conditions  of  {Zt}  imply  by  Proposition  3.3  that  (V5/(Xr,  8)}  is  also  NED  on  {Vt}  of  size  -1. 
Further,  the  elements  of  V/(X/?  8)  are  bounded,  so  that  by  Proposition  3.4(a)  [y/(Zt,d)  = 
Vsf(Xt,8)[Yt-f(X(,8)]}  is  NED  on  {V,}  of  size  -1/2.  It  follows  from  Proposition  3.2  that 
{V^/^  8)[Yt-f(xt,  8)]}  is  a  mixingale  of  size  -1/2,  given  the  mixing  conditions  imposed  on 
{Vt}  by  Assumption  C.l.  Thus,  Assumption  A.5'(i)  holds,  and  the  result  for  the  simple  RM 
procedure  follows. 

For  the  modified  RM  estimates  we  first  note  that  every  element  of  G~l  is  bounded  above  so 
that  |  G"1  |  <  A  for  some  A. 

Now, 


\Y2(z,6)\  = 


G~l  Vsffc  8)\y-f{x,  8)} 
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z\g-'\  \Vsfo,  8)[y-ffr,  sm\ 


and 


ZAQ2(x)[\y\  +Qi(x)]> 


\yri(?,e)\m 


vecCVsfQcdVsfQcti'-G) 


vec(Vaf(x,8)V&f(x,S)') 


+  I vec  G 


=  [trC?sf(x,  $)Vsf(.x,  5)'Vfi/0c,  S)Vsf(.*>  $n]*  +  \vec  G 


=  \tr  [VaAx,  5)'  Va/Cc,  5)  Vaflx,  5)'  Vjjftc,  5) 


+  |  vec  G | 


=  |Vs/U,  <5)|2  +  |vecG| 


^  [e2w]' 


+  A, 


where  we  use  the  fact  that  | vec  A  |  =  [tr(A'A)]y\  Hence  Assumption  A.2(b.i)  holds,  as 


V(z.  0) 


Yi(z,d) 

Vi(z,  0) 


<|V^i(z,^)|  +  \y/2(z,d) 


Qiix) 


+  A  +  AQ2(x) 


\y\+Qi(x) 


=  A2(z) 


We  now  establish  a  mean- value  expansion  result  for  G    .  Recall  that  G  is  restricted  to  a 


convex  compact  set  r,  so  the  mean  value  theorem  applies.  A  matrix  differentiation  result  shows 
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that  when  G  is  symmetric  and  nonsingular,  dG~l/dgij  =  -G~lSijG~l,  where  &,  is  the  ij-th. 
element  of  G  and  Sy  is  a  selection  matrix  whose  every  element  is  zero  except  that  the  17-th  and 
;7-th  elements  are  one;  see  Graybill  (1983,  p.  358).  Hence  we  can  write 


d(vecG~l) 


-1  c    r--\ 


dgij 


=  -vec(G-lSijG-1), 


and 


z-i 


vec  (GT1  -  G2! )  =  d  VgC  Gn     (vec(G ,  -  G2)), 
d  vec  G 


where  G  lies  between  G\  and  G2.  (A  different  mean  value  applies  to  each  element,  but  we  leave 
this  implicit)  It  follows  that 


!-l     n-\ 


:-l  ^n-\ 


\G?-G?\  <  \vec(GT  -GJ1)!  <  A  I  vec(G{  -G2)  I 


because  G  J  is  bounded.  We  then  observe  that 


|V2(z.0i)-V2(z.  02)1 


1-1 


GT1  V^(x,  ^[y-ZC*.  501-G?  vVfr  ^)[y-/(x,  fe)] 


— 1 


GT1  Vsf(x,  SOly-fix,  80]  -GT1  VV(*,  £)[?-/(*,  fc)] 


-1 


GT1  ?*/(*,  %)[y-/0c,  ^l-GrVaAx.  $z)[y-/Cc.  %)] 


^a[|v| 


£l(x)  +  G2(*)£l(*)  +  Gl(*)*«20O 


]  Ift-ft 


+  (hCO 


\y\  +Qi(x) 


G:    -G2 


<A 


\y  I  ^iW  +  G2W1 00  +  Qi(jc)L2(x) 


5, -5z  I  +Ag2(i) 


\y\  +Qdx) 


vec  (G\  -G2) 


<^(Z)  |01-02  I  . 


where 


and 
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^(z)  =  a[|> 


Li(x)  +  (<2200M*)  +  Qi(x)L2(x)  +  Q2(x) 


\y\+Q\(*) 


\Wi(z,dl)-yi(z,d2) 


vec 


VsffrdOVsftx.SO'-G: 


-vec 


vsf(.x,$i)Vsf(x,a1y-G2 


vec 


VsKXtdOVs/frSiY  -™c  V^Ct,  ft)  V^/Cc,  ft)' 


+  |v£c(G2-G1)|.(a.l9) 


The  first  term  of  (a.  19)  is  less  than 


vec 


vec 


[v«/(x.  ft)  V5/U,  ft)']  -vec  [v6f(x,  80  V*/(jc,  ft)' 
[v5/(*,  ft)V5/(;c,ft)']  -vec  [Vs/(*,ft)V5/Cc,ft)'l 


(7®  V5/(x,  50)  [v«/(jc,  ft)-V*/(*,  ft)] 
(Vsfix,  ft)®/)  [v5/(x,  ft)-V5/(;c,  ft) 


■[' 


I®V5f(x,  80  \   +   |  V5/(;c,  ft)®/| 


VsffrSO-Vsfte  ft)  I 


$  [  I  7®  V5/(jc,  50  I   +   I  V5/(;c,ft)®/|]  L2(x)   I  ft -ft  |, 

where    we    used    the    fact    that     vec  (ABC)  =  (C  ®  A)  vec  5.     It    can    be    verified    that 
|/®V5/U,  ft)  I  <  I  vec(/®V5/U,  ft))  I  <  it  I  Vsf(x,80  I       and       |V*/(*fft)®/|  < 
£  I  Vsf(x,  ft)  I   ,  where  &  is  the  dimension  of  5.  Thus,  (a.  19)  becomes 


i 


Y\  (z,60-¥i  (z>0i) 
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<2KQ2(x)L2(x)\Sl-&2\   +   \vec(G2-GO\ 

<  hi  (z)  \dx-d2  |  , 
where  h3  (z)  =  2K  Q2  (x)  L2  (x)  +1  .  Hence  Assumption  A.2(b.ii)  holds,  as 


\Y(z,0)-yf(z,02)\m 


yri(z,0i)-\iri(z,d2) 
¥i(z,di)-Y2(z,02) 


<hz{z)\ex-d2\ , 

with  h3{z)  =  h3{z)  +  hi (z).  Using  the  same  arguments  as  before  we  have  that  [h2{Zt)  -  Eh2(Zt)} , 
{h3(Zt)-Eh3(Zt)},  and  {y/(Zt,  6)  -  Ey(Zt,  6)}  are  mixingales  of  size  -1/2.  Hence  Assumption 
A.5'  also  holds.  This  yields  the  desired  results  for  the  modified  RM  estimates. 

The  conclusions  for  the  quick  RM  estimates  follow  because  the  quick  algorithm  is  a  special 
case  of  the  modified  algorithm.        □ 

PROOF  OF  COROLLARY  3.6:  We  verify  the  conditions  of  Corollary  2.5.  For  the  simple  RM 
estimates  we  need  to  show  that  Assumptions  B.2(b)  and  B.5'  hold.  In  this  case 

Ve  ¥{z,  0)  =  V5( V5  /  {x,  8)  [v  -/  Oc,  5)}) 

=  ^ss  fix,  S)  [y-f(x,  5)]  -  Vsf(x,  5)  Vsf(x,  $)' , 
hence  for  6°  in  int  8  and  6  in  6" 

\V9y(z,e)-V9Y(z,d°)\ 

=  |v5(5/(y-Jo-v5/v5/'-v5<5r(v-r)  +  v5/ov5jr'i 
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where  we  have  written  f=f(x,  $),  f°  =/(*,  <5°),  etc.  By  Assumption  D.2, 6°  =  <5°  =  <5*.  Applying 
Assumption  D.3  we  get  |  %#  -  Tfefy  \  <  \y  \L3(x)  1 8-  8°  | ,  and 

\W)f-(V8sf)f\  ^  \W)f-(V8sr)f\  +  \Mar)f-C%af)f\ 
^\^saf\  Lx(x)  \8-&>\  +  Qi(x)L3(x)  \8-8>\ 
±  [Qz(x)Lx(x)  +  Qx(x)L3  (*)]    |  8-8°  |  , 
since  |  Vgaf  |  <  03(X),  with  Q3  Lipschitz-continuous  in*  by  straightforward  arguments.  Further, 

£2fi2(*)£2C*)l*-*|. 


so  that 


| Vd¥(z,  d) -  V9yf(z,  0° )  |  < h4(z) \8-8°\,  (a.20) 


where 


h4(z)  =  \y\L3(x)  +  |  Q3(x)  |  L,(jc)  +  QiGOW*)  +  2<22(*)M*)  • 

Thus,  Assumption  B.2(b)  holds. 

Proposition  3.3  again  ensures  that  Q\(x),  Qi(x),  Q3(x)  Lx(x),  L2(x)  and  L3{x)  are  NED  on 
{V,}  of  size  -8.  The  same  arguments  as  in  the  proof  of  Corollary  3.5  ensure  that  [y/*}, 
{h4(Zt)-Eh4(Zt)},  {Vey/*  -H*i),  and  {|V9i^*|  -h\)  are  mixingales  of  size  -4.  The  fact  that 
{y* }  is  a  near  epoch  dependent  function  of  the  mixing  sequence  {Vt)  of  size  -4  and  ||y*ll8  <  A 
implies  by  Proposition  3.4(c)  that  there  exist  K<°°  and  {bt}  such  that 
supy>o||£°(v/'*  y/*+j)-E\y*  Y*+jh  <Kbly  where  bt  is  of  size  -2.  Hence  Assumption  B.5'  holds, 
and  the  conclusions  now  follow  from  Corollary  2.5  with 
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h\  =  E(ve  ¥; ) = EC**/;  e*t )  -  £(w;  w  )=r-c* 


and 


•*  *  *  1-7     r*  ' 


2i=  I  E(¥;yft+j)=  £  £(V,  et  et+J  V&fl+j) . 
For  the  modified  RM  algorithm  we  verify  Assumption  B.2(b)  for  the  (k2  +  k)  x  (&2  +  A:)  matrix 


Ve  ^  (z,  0)  = 


Vgv2(z,0)      V5^2(z,<9) 


(a.21) 


[V5/'(y-/)®/J  VcCvecG"1)      G"1  [  \&s/(y-/)-  WW'] 


where  VG  stands  for  the  gradient  with  respect  to  vec  G. 


As  in  the  proof  of  Corollary  3.5  we  have 


[\5/®/  +  /®Vl  W-  [W®/  +  /®W°]  W 


[  W  ®  /  +  /  ®  \g/  ]  (W  -  Vsaf° ) 


[W®  /  +  /  ®  V&/]  -  [V&/*  ®  /  +  /  ®  V^n 


VW 


where 


<  2*e20c)L3(x)|5-^|  +  g3(*)  W-W°]®/|  +(23(*)|/®[W-W°] 


<  2/:j22W^3U)|5-5°|  +  2*fi3(x)r.2(*)|$-$fl 


=  A4  (z)  I  a -a* 


A4  (*) «  2*  (<22  (x)  L3  to  +  Qi(x)  L2  CO) 
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We  also  have 


o\-l- 


[  W(y~f)®n  VG  (vecG~l)~  [  Vsf'<y-n®n  VG  (vec  (GT1) 


-l 


[  W  (y  -f)  ®  /  -  V  (y  -/•)  ®  /  ]  vG  (vec  G"1 ) 


[V"(y-/")®/] 


ON-l 


VgCv^G-O-VgCv^CGT1) 


<*A  [\y\Ll(x)  +  Q2(x)Ll(x)  +  Ql(x)L2(x)]  \6-P\  + 


kAQ2(x) 


bl+GiGO 


vec(G-G°) 


where 


</i4'(z)|0-6H  , 


[„ 


hA  00  =  *  A     yU,1(x)  +  fi2W^iC»)  +  fii«i^W  +  fi2W(b   + 


CiGO)] 


We  then  have 


o\-l 


G"1  [VW(v-/)-Va/-V&/-]-(GT1  [V5aT(v-/°)-V&rV&r  ] 


'-i 


[Vssf(y-f)  -  WW1-  [VW  (y  -/•)  -  W*  VI 


— 1       //~ox-l 


(G"1  -(GT1)  [VsaT  (y-n-Vsf  ^sf  ] 


(a.22) 


It  follows  from  (a.20)  that  the  first  term  in  (a.22)  is  less  than 


| v  |  L3  (x)  +  Q3(*)  Lj  W  +  Gi  (x)  L3  (x)  +  2  Q2(x) L2  (x) 


8-8n 


It  can  also  be  verified  that  the  second  term  in  (a.22)  is  less  than 
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I G-1  -  (G°rl  |  [  G3fr)  ( b  I  +  g  1  (*))  +  (Q2 (x))2  ] 

<  A  [flsfrHbl  +Qi  GO)  +  (G2OO)2  ]  I  vec(G-G°)| 


Thus  (a.22)  becomes 


o\-i 


G"1  [ W  (y  -/)  -  W  V&f]  -  (G  V  [\&f  (y  -/*)  -  V^  V^ '] 


<  *T(z)  |a-a°| , 


where 


h4   (z)"A 


b  I  L3GO  +  gsGOLtOO  +  G1COL3GO  +  2Q2(x)L2(x) 


+  Q3(x)(\y\  +Qi(x))  +  (Q2(x))2]. 

We  also  note  the  fact  that   \A\  <  \vecA  I  <I,Iy  la»/l,  where  A  is  a  square  matrix  and  aiy 
are  its  elements.  Combining  these  results  we  immediately  get 


V9y(z,e)-Vey/(z,d°) 


<  h4(z)\e-e°\, 


where  h4  (z)  2  /i4  (z)  +  /i4  (z)  +  h4  (z)  .  This  establishes  Assumption  B.2(b).  All  other 
conditions  can  be  verified  as  the  proof  for  the  simple  RM  algorithm.  Thus  the  asymptotic 
distribution  result  of  5t  follows  from  Corollary  2.5  with 

H'2=E(Vd¥t), 
where  Ve  y/  (z,  0)  is  given  by  (a.21),  and 


I2  =  IJUo  E  (y*  yUjl 


1"=-  E 


*  * 


-46- 


where  v*u  =  vec[Vsf  Vsf'-G*]  and  v4  =  G*~l  Vsf  (Yt-f) . 

The  proof  of  the  quick  RM  algorithm  is  similar;  we  omit  the  details.  D 

PROOF  OF  COROLLARY  3.7:  The  result  for  the  simple  RM  algorithm  follows  easily  from 
Corollary  3.6  because 

h\  =  e  Ms/!  (Yt  -f)  -  vsf  wsf  0 


=  E  {Vssft  E  {Yt  -f  I  Xt,  Zr_! ,  ...)  -  V8f  %/t  0 


=  E(-Vs/!\/!l  =  -G\ 


For  the  modified  RM  algorithm,  H*2  becomes 


m  = 


"/*>  E  (Fkfi  ®Ik  +  Ik®Vsft]  ^56  ft) 

o       -/* 


Clearly,  all  eigenvalues  of  Hi  equal  -1.  Hence  all  eigenvalues  of  H2=  Hi  +  III  equal  -A  and 
satisfy  the  requirement  of  Assumption  D.2.  We  also  observe  that  the  lower  right  kxk  block  of  I2 


is 


I~=~o  E  [yri  ¥{t+J']  =  G*"1  \z~^E{V5fetet+j  \ft+Jl\  G*~l 


=  G^I^G'"1 


Owing  to  the  block  triangularity  of  H*2,  the  lower  right  kxk  block  of  F*2  becomes 


exp[(-/*/2)s]  G*_l  I?  G*-1  exp[(-V2)5]^ 


^ 


=  TexpC-^^G*-1  I?G 


-i  v°  r*-l 
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=  g^zTg'-1. 


where  the  first  equality  follows  from  the  fact  that 


exp[(-/*/2)s]  =  expH/2)7*  =  [exp  (-5/2)  ]  4 


This  proves  that  (r  +  1)*  (5,  -  $*)  4  2V  ( 0,  G  ~l  Si  G  -1) 


For  the  quick  RM  algorithm, 


Hi   = 


0  -c*-lG* 


is  also  block  triangular,  and  the  lower  right  kxk  block  of  I3  is 


c*-2Z;=_£(V5/;erer+y  V5ft+J')  =  c*"2  ^ 


It  follows  that  the  lower  right  kxk  block  of  F3  is 


j°°  exp[(-c*_1  G*  + /^.sKc^^IOexpK-c*"1  G*  +  Ik/2)s]ds=F 


3 


so  that 


(r  +  iy/J(5/_^)^  iv(o,F;) 


,*     ^*_1V<?^*_1 


We  now  show  that  Fx-G    l  li  G    l  is  a  positive  semidefinite  matrix.   From  Theorem 


2.4(c)  we  get 


-*      ^* 


*      *-.* ,  ■.,* 


-Ij  =//!  Fi  +  FX  Hi  =(H\  +  1I2)F\  +F\(H\  +1/2) 


=  H\F\  +  F\  H\  +  F\ 


Hence, 


-(G'^I^G*)"1 
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=  (G*  r1  {H\  F\  +  F\  ti\  +  F\  )(G*  r1 


?*/v~  •\-l      m*\-\ 


*  \-l  c*  rn  *  \-\ 


=  -F\(G* Tl  - (P  TlF\  +  (G* TFKG' ) 


so  that 


F\-G*-'ZXG*-1 


-  C*  _  IT?  /Y3*V1 


*s_1        „* 


-1 


*\-l   C*  /V*\-l 


=  FI  -FI  (GT  -  (GV  FJ  +  (GT1  FI  (G'> 


(Flf'-CF^CGV 


(Fl^-^Wr1 


is  positive  semidefinite,  where  (Fj  )1/j  is  such  that  {F\  )'/l  (F*  )Vl  =  F\.  But  l!  =  I?   given  that 
/J"  =  E{Yt  IX„ZM,...),  and  the  result  holds. 


*-l  v°r*-l 


It  remains  to  prove  that  F3  -G      Ii  G    *  is  a  positive  semidefinite  matrix.   It  is  readily 
verified  that  the  lower  right  kxk  block  of 


*      — » 


—  S3  =  /13  F3  +  F3  /13 


is 


-c*~2  l!  =  (-c*_1  G*  +  7*72)  F3  +  F3  (-C*-1  G*  +  /jfe/2) . 


The  result  of  the  simple  RM  algorithm  shows  that  -G    =  #1 .  We  immediately  get 


-ii  =  (c*y 


(-c*_1  G*  +  Ik/2)F3  +F3  (-c*_1  G*  +  7*72)1 


=  (c  y 


(c*"1  //I  +  7*72)  F3  +  F3  (c*"1  //J  +  7*72) 


=  c*  H\  F3  +  c*  F3  77j  +  c*2  7^ 


Consequendy, 
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,*-i 


F3-G     'ZxG*-! 


=  F3+c*  G*-1  H\  f3  G*"1  +  c*  G*"1  F3  H\  G*"1  +  c*2  G*"1  F3  G*"1 


-*       „  -* 


-* 


F3  -c*  F3  G'-1  -  c*  G*-1  Fz  +  c"2  G'-'  f  3  G*"1 


=    [(^f-c'^'fG-1 


ftf-c'ftye*-1 


"*     V.       ~*    M  ~* 


is  positive  semidefinite,  where  (F3)Vt  is  such  that  (F3)Yl  (F3)%  =  F3.  Since  Zt  =  Z1?  the  result 


holds.    □ 


PROOF  OF  COROLLARY  4.1:  Owing  to  the  compactness  of  the  relevant  domains,  the 
special  structure  of /in  (4.1)  and  the  continuous  differentiability  of  F,  it  is  straightforward  to 
verify  the  domination  and  Lipschitz  conditions  required  for  application  of  Corollary  3.5.        □ 


PROOF  OF  COROLLARY  4.2:  Direct  application  of  Corollary  3.6.        □ 
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